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Abstract 

In several standard models of dynamic programming (gambling houses, MDPs, POMDPs), 
we prove the existence of a robust notion of value for the infinitely repeated problem, 
namely the pathwise uniform value. This solves two open problems. First, this shows that 
for any e > 0, the decision-maker has a pure strategy a which is e-optimal in any n-stage 
game, provided that n is big enough (this result was only known for behavior strategies, 
that is, strategies which use randomization). Second, the strategy a can be chosen such 
that under the long-run average payoff criterion, the decision-maker has more than the 
limit of the n-stage values. 

Keywords: Dynamic programming, Markov decision processes. Partial Observation, Uniform 
value, Long-run average payoff. 
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Introduction 

The standard model of Markov Decision Process (or Controlled Markov chain) was introduced 
by Bellman [?] and has been extensively studied since then. In this model, at the beginning 
of every stage, a decision-maker perfectly observes the current state, and chooses an action 
accordingly, possibly randomly. The current state and the selected action determine a stage 
payoff and the law of the next state. There are two standard ways to aggregate the stream of 
payoffs. Given a strictly positive integer n, in the n-stage MDP, the total payoff is the Cesaro 
mean n~^ Y^m=i dm, where gm is the payoff at stage m. Given A € (0,1], in the A-discounted 
MDP, the total payoff is the A-discounted sum A^^^;^(l — The maximum payoff 

that the decision-maker can obtain in the n-stage problem (resp. A-discounted problem) is 
denoted by Vn (resp. v\). 

A huge part of the literature investigates long-term MDPs, that is, MDPs which are repeated 
a large number of times. In the n-stage problem (resp. A-discounted problem), this corresponds 
to n being large (resp. A being small). A first approach is to determine whether (vn) and {vx) 
converge when n goes to infinity and A goes to 0, and whether the two limits coincide. When 
this is the case, the MDP is said to have an asymptotic value. The asymptotic value represents 
the long-term payoff outcome. When the asymptotic value exists, a second approach consists 
in determining if for any e > 0, there exists a behavior (resp. pure) strategy that is optimal up 
to e in any n-stage and A-discounted problem, provided that n is big and A is small. When this 
is the case, the MDP is said to have a uniform value in behavior (resp. pure) strategies. 

A third approach is to define the payoff in the infinite problem as being the expectation 
of liminf„^+oo literature, this is referred as the long-run average payoff 
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criterioT^ (AP criterion, see Arapostathis et al. [3] for a review of the subject). We denote by 
Woa the maximal payoff that the decision-maker can guarantee under this criterion. Clearly, 
under this criterion, the decision-maker cannot have more than liminf„_j._|_oo A natural 
question is whether he can obtain liminf„_>_|_oo Vn- 

When the set space and action sets are finite, Blackwell has proved the existence of a 
pure strategy that is optimal for every discount factor close to 0, and one can deduce that the 
uniform value exists in pure strategies, and that under the AP criterion, the decision-maker 
can have lim„_j.+oo 

In many situations, the decision-maker may not be perfectly informed of the current state 
variable. For instance, if the state variable represents a resource stock (like the amount of oil in 
an oil field), the quantity left, which represents the state, can be evaluated, but is not exactly 
known. This motivates the introduction of the more general model of Partially Observable 
Markov Decision Process (POMDP). In this model, at each stage, the decision-maker does not 
observe the current state, but instead receives a signal which is correlated to it. Rosenberg, 
Solan and Vieille [TS] have proved that any POMDP has a uniform value in behavior strategies, 
when the state space, the action set and the signal set are finite. In the proof, the authors 
highlight the necessity that the decision-maker resort to behavior strategies, and ask whether 
the uniform value exists in pure strategies. They also raise the question of the behavior of the 
time averages of the payoffs, which is linked to the AP criterion. Renault m and Renault and 
Venel m have provided two alternative proofs of the existence of the uniform value in behavior 
strategies in POMDPs, and also ask whether the uniform value exists in pure strategies. 

One of the main contributions of this paper is to solve this question positively. We prove 
that POMDPs have a uniform value in pure strategies. Moreover, for all e > 0, under the AP 
criterion, the decision-maker can have lim„_>_|_oo Vn — In fact, we prove this result in a much 
more general framework, as we shall see now. 

The result of Rosenberg, Solan and Vieille [15] (existence of the uniform value in behavior 
strategies in POMDPs) has been generalized in several dynamic programming models with 
infinite state space and action set. The first one is to consider the model of gambling house. 
Introduced by Dubins and Savage nni, a gambling house is defined by a correspondence from 
a metric space X to the set of probabilities on A. At every stage, the decision-maker chooses 
a probability on X which is compatible with the correspondence and the current state. A new 
state is drawn from this probability, and this new state determines the stage payoff. When 
the state space is compact, and the correspondence is 1-Lipschitz, and the payoff function is 
continuous (for suitable metrics), the existence of the uniform value in behavior strategies stems 
from the main theorem in m- One can deduce from this result the existence of the uniform 
value in behavior strategies in MDPs and POMDPs, for a finite state space and any action and 
signal sets. Renault and Venel m have extended the results of m to more general payoff 
evaluations. 

The proofs in Renault [15] and Renault and Venel [16] are quite different from the one of 
Rosenberg, Solan and Vieille [H] . Still, they heavily rely on the use of behavior strategies for 
the decision-maker, and they do not provide any results concerning the AP criterion. 

In this paper, we consider a gambling house with compact state space, closed graph corre¬ 
spondence and continuous payoff function. We show that if the family {u„,n ^ 1} is equicon- 
tinuous and Woo is continuous, the gambling house has a uniform value in pure strategies. 
Moreover, for all e > 0, the decision-maker can guarantee lim„_^_|_oo Vn — € under the AP crite¬ 
rion. This result especially applies to 1-Lipschitz gambling houses. We deduce the same result 
for compact MDPs with 1-Lipschitz transition, and POMDPs with finite set space, compact 
action set and hnite signal set. 

Note that under an ergodic assumption on the transition function, like assuming that from 
any state, the decision-maker can make the state go back to the initial state (see Altman m), 
or assuming that the law of the state variable converges to an invariant measure (see Borkar 

^In some papers, the decision-maker minimizes the cost: in this case, the long-run average payoff criterion 
corresponds to the long-run average cost criterion. 
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mm, these results were already known. One remarkable feature of our proof is that we are 
able to use ergodic theory without any ergodic assumptions. 

The paper is organized as follows. The first part presents the model of gambling house and 
recalls usual notions of value. The second part defines pathwise uniform value and states our 
results, that is, the existence of the pathwise uniform value in gambling houses, MDPs and 
POMDPs. The last three parts are dedicated to the proof of these results. 


1 Gambling houses 

1.1 Model of gambling house 

Let us start with a few notations. We denote by N* the set of strictly positive integers. If A 
is a measurable space, we denote by A(A) the set of probability measures over A. If {A,d) is 
a compact metric space, we will always equip [A, d) with the Borelian algebra, and denote by 
B{A) the set of Borel subsets of A. The set of continuous functions from A to [0,1] is denoted 
by C{A, [0,1]). The set A(d) is compact metric for the Kantorovich-Rubinstein distance dKR, 
which metrizes the weak* topology. Recall that the distance dKR is defined for all z and z' in 
A(d) by 


dKRiz, z') := sup 

f€Ei 


/ f{x)z{dx) - / f{x)z'{dx) 
I A J A 


= inf / d{x,y)7r{dx,dy), 
7re^(^,2') JaxA 


where Ei C C{A, [0,1]) is the set of 1-Lipschitz functions from A to [0,1] and n(z, z') C A(dx A) 
is the set of measures on A x A with first marginal z and second marginal z'. Because A is 
compact, the infimum is a minimum. For / S C(A, [0,1]), the linear extension of / is the 
function / € C(A(A), [0,1]), defined for z € A (A) by 

f{z) ■= [ f{x)z{dx). 

J A 

A gambling house T = {X, F, r) is defined by the following elements: 

• A is the state space, which is assumed to be compact metric for some distance d. 

• F : {X, d) ^ {A{X),dKR) is a correspondence with a closed graph and nonempty values. 

• r : A —>■ [0,1] is the payoff function, which is assumed to be continuous. 

Remark 1. Because the state space is compact, F is a closed graph correspondence if and only 
if it is an upper hemicontinuous correspondence with closed values. 

Let xo £ A be an initial state. The gambling house starting from Xq proceeds as follows. 
At each stage to ^ 1, the decision-maker chooses Zm £ F(xm-i)- A new state Xm is drawn 
from the probability distribution Zm, and the decision-maker gets the payoff r(xm)- 

For the definition of strategies, we follow Maitra and Sudderth [H Chapter 2]. First, we 
need the following definition (see [HI Chapter 11, section 1.8]): 

Definition 1. Let v £ A(A(A)). The barycenter of v is the probability measure y, = Bar(i^) £ 
A(A) such that for all f £ C(A, [0, Ij), 


Kt) = [ hzffidz). 

Ja{x) 

Given M a closed subset o/A(A), we denote by ScoM the strong convex hull of the set M, 
that is, 

ScoM := {Bar(j^), v £ A(M)} . 

Eguivalently, ScoM is the closure of the convex hull of M. 
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For every to ^ 1, we denote by Hm ■= the set of possible histories before stage to, 
which is compact for the product topology. 

Definition 2 . A behavior (resp. pure) strategy a is a sequence of mappings a := (crm)m>i 
such that for every to ^ 1, 

• cFrn ■ Hm *5 (Borel) measurable, 

• for all hm — (^0: ■ • ■; 1) ^ Hm; CTm^hm^ ^ Sco(F^(Xm_i )) (rCSp. O'mihm^ F{Xm — l))- 

We denote by T, (resp. Tip) the set of behavior (resp. pure) strategies. 

Note that Sp C S. The following proposition ensures that Tp is nonempty. This is a special 
case of Kuratowski-Ryll-Nardzewski theorem (see [TJ Theorem 18.13, p. 600]. 

Proposition 1 . Let Ki and K2 he two compact metric spaces, and $ : Ki =4 K2 he a closed 
graph correspondence with nonempty values. Then $ admits a measurable selector, that is, there 
exists a measurable mapping (p : Ki K 2 such that for all k € Ki, ip{k) G K 2 . 

Proof. In [I], the theorem is stated for weakly measurable correspondences. By [U Theorem 
18.10, p. 598] and [TJ Theorem 18.20, p. 606], any correspondence satisfying the assumptions 
of the proposition is weakly measurable, thus the proposition holds. □ 

Definition 3 . A strategy a G T is Markov if there exists a measurable mapping / : N* x —>■ 
A(X) such that for every hm = (xq, ■■., Xm-i) G Hm; cF{hm) = f{'m,Xm-i)- When this is the 
case, we identify a with f. 

A strategy a is stationary if there exists a measurable mapping / : A —> A(A) such that for 
every hm = {xq, ...,Xm-i) G Hm; cr{hm) = fixm-i)- When this is the case, we identify a with 

/• 

Let Hoo ■= be the set of all possible plays in the gambling house T. By the Kolmogorov 
extension theorem, an initial state xq G X and a behavior strategy a determine a unique 
probability measure over iJoo, denoted by P^“. 

Let xo G X and n ^ 1. The payoff in the n-stage problem starting from xq is defined for 
cr G S by 



where := r{xm) is the payoff at stage m G N*. The value Vn(xo) of this problem is the 
maximum expected payoff with respect to behavior strategies: 


Vn(xo) '■= sup7„(xo,cr). 

ctSE 

By Feinberg [TTJ Theorem 5.2], any behavior strategy can be assimilated to a probability mea¬ 
sure on the set of pure strategies. It follows that the above supremum is reached at a pure 
strategy. 


Remark 2. For p G A(X), one can also define the gambling house with initial distribution p, 
where the initial state is drawn from p and announced to the decision-maker. The definition of 
strategies and values are the same, and for all n G N*, the value of the n-stage gambling house 
starting from p is equal to Vn{p). 
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1.2 Long-term gambling houses 

1.2.1 Uniform value 

Definition 4. Let Xq € X. The gambling house r(a;o) has an asymptotic value Vca(xo) S [0,1] 
if the sequence (vn(xo))n^i converges to Vca(xo)- 

Definition 5. Let xq G X. The gambling house r(a:o) has a uniform value t'oo(a^o) G [0,1] in 
behavior (resp. pure) strategies if it has an asymptotic value Voo(xo) and for every e > 0, there 
exists no G N* and a behavior (resp. pure) strategy a such that for all n ^ no, 

7«(a;o,cr) ^ (xo) -£. 

Definition 6. A gambling house T is 1-Lipschitz if its correspondence F is 1-Lipschitz, that 
is, for every x G X, every u G F(x) and every y G X, there exists w G F{y) such that 
dKR{u,w) < d{x,y). 

Renault and Venel |16j have proved that any 1-Lipschitz gambling house has a uniform value 
in behavior strategie^. They asked about the existence of the uniform value in pure strategies. 
This is a recurring open problem in the literature. In the framework of POMDPs, this open 
problem already appeared in Rosenberg, Solan and Vieille [TH] and in Renault m- 

1.2.2 The long-run average payoff criterion 

To study long-term dynamic programming problems, an alternative to the uniform approach 
is to associate a payoff to each infinite history. Given an initial state xq G X, the infinitely 
repeated gambling house roo(a^o) is the problem with strategy set S, and payoff function 700 
defined for all cr G E by 

/ 1 n 

7 oo(a;o,cr) := liminf - 

\ n^+oo n 

\ m —1 

In the literature, the above payoff is often referred as the long-run average payoff criterion (see 
i)- The value of roo(a;o) is 

Woo{xo) := sup7oo(a;o,cr). 
o-gS 

Remark 3. The above supremum may not be reached: there may not exist 0-optimal strategies 
in roo(a:o) (see for example Rosenberg, Solan and Vieille m)- 

The following proposition plays a key role in this paper: 

Proposition 2. For all e > 0, there exists e-optimal pure strategies in roo(a:o)- 

Proof. Exactly like for the n-stage game, this result is a direct consequence of Theorem 5.2 in 
Feinberg m- □ 

If r(a;o) has a uniform value Voo{xo), we have Wooixo) ^ Voo{xo) by the dominated conver¬ 
gence theorem. A natural question is to ask whether the equality holds. When this is the case, 
it significantly strengthens the notion of uniform value, as shown by the following example. 

Example 1. There are two states, x and x*, and F{x) = F{x*) = {a;, a;*}. Moreover, r(x) = 0 
and r(x*) = 1. Thus, at each stage, the decision-maker has to choose between having a payoff 
0 and having a payoff 1. Obviously, this problem has a uniform value equal to 1. Let e > 0. Let 
a be the strategy such that for all n G N, at stage 2 ^ — 1 , the decision-maker chooses x with 
probability e/ 2 , and sticks to this choice until stage 2 ^ — 1 ; with probability X — ejl, he chooses 

^In fact, their model of gambling house is slightly different: they do not assume that F is closed-valued, but 
instead assume that it takes values in the set of probability measures on X with finite support. 
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X*, and sticks to this choice until stage 2^ ^ — 1. The strategy a is uniformly e-optimal: there 
exists no G N* such that for all n ^ hq, 


7 „(a;,cr) ^ 1 - e. 


Nonetheless, by the law of large numbers, for any no G N*, there exists a random time T such 
that P“ almost surely, T ^ no and 

1 ^ 




Therefore, the strategy <j does not guarantee more than e in the game roo(a;)- 


2 Main results 

2.1 Gambling houses 

We introduce a stronger notion of uniform value, which allows us to deal with the two open 
questions mentioned in the previous section at the same time. 

Definition 7. Let xq G X. The gambling house r(a;o) has a pathwise uniform value in behavior 
(resp. pure) strategies if 

• The gambling house r(a;o) has an asymptotic value Uoo(a;o). 

• For all e > 0, there exists a behavior (resp. pure) strategy a such that 

Too (xo,cr) > '^OO (xo) - e. 


A strategy a satisfying the above equation is called pathwise e-optimal strategy. When for all 
Xo G X, r(a;o) has a pathwise uniform value in behavior (resp. pure) strategies, we say that T 
has a pathwise uniform value in behavior (resp. pure) strategies. 

Proposition [5] implies that there exists a pathwise uniform value in behavior strategies if 
and only if there exists a pathwise uniform value in pure strategies. The following proposition 
shows that the concept of pathwise uniform value is more general than the concept of uniform 
value. 


Proposition 3. Assume that r(a:o) has a pathwise uniform value (in behavior or pure strate¬ 
gies). Then it has a uniform value in pure strategies. 

Proof. By Proposition [21 r(a;o) has a pathwise uniform value in pure strategies. Let e > 0, and 
cr be a pathwise e-optimal pure strategy. We have 



> foo(a;o) - e. 


By Patou’s lemma, it follows that 


liminfE^o 

n—>-+oo 



> foo(a;o) - e, 


and the gambling house r(a;o) has a uniform value in pure strategies. 

□ 


We can now state our main theorem concerning gambling houses: 
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Theorem 1. Let T be a gambling house such that {vn,n ^ 1} zs uniformly equicontinuous and 
Woo is continuous. Then F has a pathwise uniform value in pure strategies. Consequently, it 
has a uniform value in pure strategies, and 

^OO - ^oo- 

In particular, we obtain the following result. 

Theorem 2. Let T be a 1-Lipschitz gambling house. Then T has a pathwise uniform value in 
pure strategies. Consequently, it has a uniform value in pure strategies, and 

^OO - ^oo- 

In the two next subsections, we present similar results for MDPs and POMDPs. 

2.2 MDPs 

A Markov Decision Process (MDP) is a Tuple P = {K,L,g,q), where {K,dK) is a compact 
metric state space, {I,di) is a compact metric action set, g : K x I ^ [0,1] is a continuous 
payoff function, and q : K x I ^ is a continuous transition function. As usual, the set 

A(A') is equipped wih the KR metric, and we assume that for all z € /, q{.,i) is I-Lipschitz. 
Given an initial state ki a K known by the decision-maker, the MDP r(A:i) proceeds as follows. 
At each stage m ^ 1, the decision-maker chooses im S /, and gets the payoff gm ■= g{km,im)- 
A new state km+i is drawn from q{km,im), and is announced to the decision-maker. Then, 
r(fci) moves on to stage to -|- 1. A behavior (resp. pure) strategy is a measurable map a : 
Cm^iK X (/ X (resp. a : Um^iK x (/ x ^ I). An initial state ki and 

a strategy a induce a probability measure on the set of plays TFoo = [K x I)^ . 

The notion of uniform value is defined in the same way as in gambling houses. We prove 
the following theorem: 

Theorem 3. The MDP T has a pathwise uniform value in pure strategies, that is, for all 
ki £ K, the two following statements hold: 

• The sequence (vniki)) converges when n goes to infinity to some real number Voo{ki). 

• For all e > 0, there exists a pure strategy a such that 

I liminf-^ 5 (fcm,Zm) ) ^ z;oo(A:i) - e. 

\ m=l / 

Consequently, the MDP P has a uniform value in pure strategies. 

2.3 POMDPs 

A Partially Observable Markov Decision Process (POMDP) is a 5-uple P = [K, I, S, g, q), where 
AT is a finite set space, I is a compact metric action set, S' is a finite signal set, g : K x I ^ [0,1] 
is a continuous payoff function, and q : K x I ^ A{K x S) is a continuous transition function. 
Given an initial distribution pi € A{K), the POMDP P(pi) proceeds as follows. An initial 
state ki is drawn from pi, and the decision-maker is not informed about it. At each stage 
TO ^ I, the decision-maker chooses im £ I, and gets the (unobserved) payoff g{km,im)- A pair 
(km+i, Sm) is drawn from q{km,im), and the decision-maker receives the signal Sm- Then the 
game proceeds to stage to -I- 1. A behavior strategy (resp. pure strategy) is a measurable map 
cr : A(/) (resp. a : —>■ I). An initial distributionpi £ A{K) 

and a strategy a induce a probability measure on the set of plays iPoo ■= {K x I x S*)^ . 

The notion of uniform value is defined in the same way as in gambling houses. We prove 
the following theorem: 
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Theorem 4. The POMDP F has a pathwise uniform value in pure strategies, that is, for all 
Pi £ A(K), the two following statements hold: 

• The sequence {vnijpi)) converges when n goes to infinity to some real number Voo(pi)■ 

• For all e > 0, there exists a pure strategy a such that 

( liminf- g{kra,im) ) > Voo{pi) - £■ 

\ n— J -+00 n ^' / 

\ m=l / 

Consequently, the POMDP T has a uniform value in pure strategies. 

In particular, this theorem solves positively the open question mentioned in |18j . |15) and 
m- finite POMDPs have a uniform value in pure strategies. 


3 Proof of Theorem [T] 

Let P = {X, F, r) be a gambling house such that {vn, n ^ l}U{woo} is uniformly equicontinuous. 
Let n : X —[0,1] be defined by v := limsup„_^^(^ Vn- 

Let xq G X he an initial state. By Proposition[2J in order to prove Theorem[Tl it is sufficient 
to prove that for all e > 0, there exists a behavior strategy a such that 

7oo(a;o,cr) = I liminf - ^ v{xo) - e. 

\ n->'+oo ri ^' / 

\ m=l / 

Let us first give the structure and the intuition of the proof. It builds on three main ideas, each 
of them corresponding to a lemma. 

First, Lemma [T] associates to xq a probability measure p.* £ A(X), such that: 

• Going from xq , for all e > 0 and uq £ N* , there exists a strategy cto and n ^ no such that 
the occupation measure ^ Ym=i £ A{X) is close to qi* up to e (for the KR distance). 

• f{p*) = v{p,*) = v{xo) 

• If the initial state is drawn according to /i*, the decision-maker has a behavior stationary 
strategy a* such that for all m ^ 1, is distributed according to /r* (/i* is an invariant 
measure for the gambling house). 

Let X be in the support of fi* . Building on a pathwise ergodic theorem. Lemma [2] shows that 

1 " 

- V' 7’m v{x) P^. a.s. 
n ' 

Let y G X he close to x. Lemma [3] shows that, if ?/ £ X is close to x, then there exists a 
behavior strategy a such that 7 oo(j/,cr) is close to v{y). 

These lemmas are put together in the following way. Lemma [1] implies that, going from 
Xq, the decision-maker has a strategy erg such that there exists a (deterministic) stage m ^ 1 
such that with high probability, the state Xm is close to the support of /i*, and such that the 
expectation of v(xm) is close to v(xo). Let x be an element in the support of /i* such that 
Xm is close to X. By Lemma |3l going from Xm, the decision-maker has a strategy a such that 
hoo{xm,<^) is close to v(xm)- Let a be the strategy that plays uq until stage m, then switches 
to a. Then ^oa{xo,o) is close to n(a:o), which concludes the proof of Theorem[TJ 


8 


3.1 Preliminary results 

Let r = (X, F, r) be a gambling house. We define a relaxed version of the gambling house, in 
order to obtain a deterministic convex gambling house H : A(X) =1 A{X). The interpretation 
of H{z) is the following: if the initial state is drawn according to z, H{z) is the set of all possible 
measures on the next state that the decision-maker can generate by using behavior strategies. 
First, we define G : A =1 A(X) by 

Vx e A G{x) := Sco(F(a;)). 

By m Theorem 17.35, p.573], the correspondence G has a closed graph, which is denoted 
by Graph G. Note that a behavior strategy in the gambling house F corresponds to a pure 
strategy in the gambling house (A, G, r). For every 2 G A(A), we define H{z) by 


H{z) := G A(A) | 3 ct : A —> A(A) measurable s.t. \/x G A, a{x) G G{x) and 

V/ G C(A, [0,1]), /(m) = [ f{<^ix)Hdx) 

Jx 

Note that replacing “Va: G A, a{x) G G(a:)” by “Vx G A, a{x) G G(x) z — a.s.'' does not change 
the above definition (throughout the paper, “a.s.” stands for “almost surely”). 

By Proposition [1] H has nonempty values. We now check that the correspondence H has a 
closed graph. 

Proposition 4. The correspondence H has a closed graph. 

Proof. Let ( 2 ™,/inlneN G (Graph i7)^ such that (2„,/r„)„gN converges to some ( 2 ,/x) G A(A) x 
A(A). Let us show that p. G H{z). For this, we construct cr : A —>■ A(A) associated to p in 
the definition of H{z). 

By definition of il, for every n G N, there exists an ■ X ^ A(A) a measurable selector of 
G such that for every / G C(A, [0,1]), 

fiTn) = / f{an{x))Zn{dx). 

Jx 

Let TTn G A(GraphG) such that the first marginal of 7r„ is 2 „, and the conditional distribution 
of TTn knowing x G A is Sa^(x) G A(A(A)). By definition, for every / G C(A, [0,1]), we have 


IXxA(X) 


f{p)TTn{dx,dp) = 


lx \JA{X) 

/ f{an{x))Zn{dx) 


f{p)dcr„{x){dp) Zn{dx) 


Jx 

= Hpn)- 

The set A(GraphG) is compact, thus there exists tt a limit point of the sequence (7r„)„gr^. 
By definition of the weak* topology on A(A) and on A(GraphG), the previous equation yields 


IXxA(X) 


f{p)Tr{dx,dp) = f{p). 


( 1 ) 


To conclude, let us disintegrate tt. Let 2' be the first marginal of tt. The sets A and A(A) 
are compact metric spaces, thus there exists a probability kernel K : X x B{A{X)) [0,1] 

such that 
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• for every x £ X, K{x ,.) £ A(A(X)), 

• for every B £ B{A{X)), K{.,B) is measurable, 

• for every h £ C{X x A(A), [0,1]), 


IXxA{X) 


h{x,p) 



A{X) 


h{x,p)K{x,dp) z'{dx). 


( 2 ) 


Note that the second condition is equivalent to: “The mapping x -£ K{x,.) is measurable” 
(see [SJ Proposition 7.26, p.l34]). For every n > 1, the first marginal of 7r„ is equal to Zn 
that converges to z, thus z' = z. Define a measurable mapping a : X ^ by (j{x) := 

Ba.T{K{x,.)) £ A(X). Because tt € A(GraphG'), we have cr(a:) £ G(x) z — a.s. Let / £ 
C(X, [0,1]). Using successively (P) and @ yield 

f(d) = [ f{p)'^idx,dp) 

JxxA(X) 

= [ if f{p)K{x,dp)\ z{dx) 

Jx \Ja{x) / 

= [ f{cr{x))z{dx). 

Jx 

Thus, p £ H{z), and H has a closed graph. □ 


Let p, p' £ A{X). Denote \ - p + {1 — X) • p' the probability measure p" £ A(A) such that 
for all f £C{X, [0,1]), 

/V') = A/(/^) + (l-A)/V). 

1 " 

For (/im)mGN* £ A(A)^ and n € N*, the measure — > is defined in a similar way. 

n 

m—l 

Proposition 5. The correspondence H is linear on A(A); 

Vz, z' £ A(A), VA e [0,1], H{X-z + {l- X) ■ z') = X- H{z) + (1 - A) • H{z’). 

Proof. Let z,z' £ A(A) and A £ [0,1], then the inclusion 

iL(A • z + (1 - A) • z') C A • H{z) + (1 - A) • iJ(z') 


is immediate. We now prove the converse inclusion. Let p £ A • i7(z) + (1 — A) • H{z'). By 
definition, there exists ct : A —>■ A (A) and tr' : A —A(A) two measurable selectors of G such 
that for every / G C(A, [0,1]), 

f{p) = x[ f{a{x))z{dx) + {1-X) [ f{a'{x))z'{dx). 

Jx Jx 

Denote by tt (resp. tt'), the probability distribution on A x A(A) generated by z and a (resp. 
z' and a'). Let tt" := A • tt + (1 — A) • tt', then tt" is a probability on A x A (A) such that 
7r"(Graph(G)) = 1, and the marginal on A is A • z + (1 — A) • z'. Let cr" : A —> A(A) given 
by the disintegration of tt" with respect to the first coordinate. Let / £ C{X, [0,1]). As in the 
proof of Proposition P (see Equation (P), we have 


fid) = A 


f{p)TT{dx,dp) + (1 - A) 


L 


XxA{X) 

f{pW{dx,dp) 


XxA(X) 


f{p)TT'{dx,dp) 


XxA{X) 


[ f{<^"{x))z{dx), 
Jx 
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thus € H[X - z + {1 — \) ■ z'). 


□ 


3.2 Invariant measure 

The first lemma associates a fixed point of the correspondence H to each initial state: 

Lemma 1. Let xq G X. There exists a distribution /i* G A(X) such that 

• fx* is H-invariant: fi* G 

• for every e > 0 and N ^ 1, there exists a (pure) strategy (Tq and n ^ N such that a is 
0-optimal in rji(a;o), Vn(xo) ^ v(xo) — e and 


dKR [ - X! ] < £, 


\ m—1 / 

where Zm(xo,cro) G A(X) is the distribution of Xm, the state at stage m, given the initial 
state xo and the strategy (Tq. 

• f{p,*) = v{p,*) = v{xo). 

Proof. The proof builds on the same ideas as in Renault and Venel m Proposition 3.24, p. 
28]. Let n £ N* and ao be a pure optimal strategy in the n-stage problem r„(a;o). 

Let 

1 " 

Zn := Zmixo^ao), 


and 


m—1 


^ n+1 

71 


By construction, for every m G {l,2,...,n}, Zm+ii^o, o'o) S H{zm{xQ,cro)), therefore by 
linearity of H (see Proposition [Q 

£ H{Zn). 


Moreover, we have 


dKR{zn,z'^) < -diam(X), 


(3) 


where diam(X) is the diameter of X. 

The set A(X) is compact. Up to taking a subsequence, there exists pr* £ A{X) such that 
(i'ri(a^o)) converges to v{xo) and (z„) converges to fi*. By inequality ([3]), {z() also converges to 
p,*. Because H has a closed graph, we have p* G H{p*), and p* is iJ-invariant. By construc¬ 
tion, the second property is immediate. 


Finally, we have a series of inequalities that imply the third property. 

• u is decreasing in expectation along trajectories: the sequence 
(v(zm(xo, ao)))m'^i is decreasing, thus for every n ^ 1, 

1 ” 

v{xo) ^ v{Zmixo, CTq)) = v{Zn). 

n z —' 

m—1 

Taking n to infinity, by continuity of v, we obtain that v(xo) ^ vip*)- 
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• We showed that G Let a* : X A(X) be the corresponding measurable 

selector of G. Let us consider the gambling house where the initial state is drawn 

from fi* and announced to the decision-maker (see Remark[2|). The map a* is a stationary 
strategy in r(/x*), and for all m ^ 1, . Consequently, for all n G N*, the 

strategy cr* guarantees r(/r*) in Thus, we have 

v{n*) > 

• By construction, the payoff is linear on A(X) and 'r{zn) = Vn(xo). By continuity of f, 
taking n to inhnity, we obtain 

= v(xo). 


□ 

In the next section, we prove that in r(^*), under the strategy a*, the average payoffs 
converge almost surely to v(x), where x is the initial (random) state. 


3.3 Pathwise ergodic theorem 

We recall here the ergodic theorem in Hernandez-Lerma and Lasserre m Theorem 2.5.1, p. 
37]. 

Theorem 5 (pathwise ergodic theorem). Let {X,B) be a measurable spaee, and ^ be a Markov 
chain on (X,B), with transition probability function P. Let p, be an invariant probability mea¬ 
sure for P. For every f an integrable function with respect to p,, there exist a set Bf G B and 
a funetion f* integrable with respect to p, such that p{Bf) = 1, and for all x G Bj, 

1 " 

- E^/*(^o) P.-a.s. 


Moreover, 



f*{x)p{dx) 



f{x)p{dx). 


Lemma 2. Let Xq G X and p* G A(X) be the corresponding invariant measure (see Lemma 
m- There exist a measurable set B C A(X) such that p*{B) = 1 and a stationary strategy 
a* : X ^ X{X) such that for all x G B, 


1 .y \ 

- E -t v{x) P^. - a.s. 
n ^—' 


Proof. Because p* is a fixed point of H, there exists cr* : A —>■ A(A) a measurable selector of 
G (thus, a behavior stationary strategy in T) such that for all / G C{X, [0,1]), 


fid*) = [ .f{r{x))p*{dx). 

Jx 

Consider the gambling house T{p*). Under cr*, the sequence of states (xm)mGN is a Markov 
chain with invariant measure p*. From Theorem [5l there exist a measurable set Bq C X such 
that p*{Bo) = 1, and a measurable map w : A —;• [0,1] such that for all x G Bq, we have 


1 X E 

- E -t W{x) 

n ^' n—r+oo 


almost surely. 


and 

w{p*) = f{p*). 
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We now prove that w = v P^. — a.s.. First, we prove that w ^ v P^. — a.s.. Let x G Bo- 
Using first the dominated convergence theorem, then the definition of Vn{x), we have 

w{x) = lim 

\ n—>+oo n ^' / 

\ rn=l / 

= ^ ] 

n—n-oo \ n ^—' / 

\ m=l / 

^ limsup?;n(a;) = v(x). 

n—^+oc 

Moreover, we know by Lemma[I]that = t)(/i*), therefore 

w{fi*) =f{fi*) =v{n*). 

This implies that w = v Pj^. — a.s., and the lemma is proved. □ 


3.4 Junction lemma 

By assumption, {vn,n 5^ 1} U {iCoo} is uniformly equicontinuous. Therefore, there exists an 
increasing modulus of continuity r] : R+ ^ R+ such that 

Vx, yeX, |u;oo(a:) - Woo(j/)| ^ ??(d(a;, j/)). 


and for all n > 1, 

Vx, yeX, |u„(x) - ^ ??(d(x, j/)). 

Then, v is also uniformly continuous with the same modulus of continuity. 

Lemma 3. Let e > 0, x,y G X and a* be a strategy such that 

1 " 

- -t v{x) P^. a.s. 

n 

Then there exists a strategy a such that 

(lim inf - ^ ) > v{y) - 2T]{d{x, y)) - e. 

\ n->+oo ri ^' / 

\ m=l / 

Proof. By assumption, we have 

E%, [liminfi^r™) = E^. {v{x))=v{x), 

\ n-^+OD n ^' / 

\ / 

therefore ^(x) ^ Woo{x). Moreover, by Fatou’s lemma, Woo{x) ^ u(x). Thus, Wooix) = v(x). 


Let e > 0. By definition of Woo(y), there exists a strategy cr such that 



> Wooiy) - £, 


> w^(x) - y(d(x, y)) - £, 
= v(x) -y(d(x,y)) - e, 

^ u(y) - 2'q{d{x,y)) - e. 


We can now finish the proof of Theorem [TJ 


□ 
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3.5 Conclusion of the proof 

Proof of Theorem [TJ We can now put Lemma mm and [3] together to finish the proof of Theo¬ 
rem [TJ Fix an initial state xq & X and e > 0. We will define a strategy a as follows: start by 
following a strategy uo until some stage n^, then switch to another strategy depending on the 
state Xns- We first define the stage n^, then build the strategy a and finally check that this 
strategy indeed guarantees a good long-run average payoff. 

By assumption, the family (vn)n^i is uniformly equicontinuous. Consequently, there exists 
no G N* such that for all n ^ no and for all x £ X, 

Vn(x) ^ v{x) + e. 


We first consider Lemma [T] for xq, e' = and N = 2no. There exists an invariant 
measure, ao a (pure) strategy and ni ^ 2no such that /i* satisfies the conclusion of Lemma [T] 
and 

dxR — Zm{xo,cro),h* < 

Let B be given by Lemma [3] In general, there is no hope to prove the existence of a stage 
m such that Zm(xo,cro) is close to . Instead, we prove the existence of a stage na such that 
under the strategy ao, Xn^ is with high probability close to B, and (xq, o-q)) is close to 
v{xo). 

Let 712 = L^^iJ + ^, A = {x £ X\d{x, B) ^ e} and A‘^ = {x £ X\d{x, B) > e}. We denote 
^"1 ~ By property of the KR distance, there exists a coupling 7 £ 

A(X X X) such that the first marginal of 7 is Hm, the second marginal is /i*, and 

dKRidni,d*) = / d{x,x')j{dx,dx'). 

By definition of A, for all (x, x') £ A^^ x B, we have d{x, x') > e. Thus, Markov inequality yields 



d{x, x')^{dx, dx') 


> e7(A^ X B) 

— S-Mni {A ). 


We deduce that ^ e^. Because the n 2 first stages have a weight of order e in /i„j, we 

deduce the existence of a stage m such that ZmiA^^) < e: 






1 

ni 


ni 

E 

m—l 


1 

ni 


XL 

E + 

m—l 


1 

ni 


ni 

E ^rniA^ 

m—n2-\-l 


e min ZmiA’^), 
l^m^n 2 


and thus 


Zn,,iA^) := min ^ e. (4) 

l^m^n 2 

Moreover, v{zn,^{xo,ao)) is greater than v(xo) up to a margin e. Indeed we have 

v{zn3ixo,ao)) ^ 'u„i_„3+i(z„3(xo,cro)) -e 
^ 'Cni(a:o)-e 

ni 

^ 'u(xo) — 2e — e. 

^ v(xo) — 3e. 
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Using Equation 0 and the last inequality, we deduce that 

K°oi^Av{xn:,)) ^ ]£““(?;(a:„ 3 )) - Zn^{A^) ^ v{xo) - 4e. 

We have defined both the initial strategy cto and the switching stage ns- To conclude, we 
use Lemma [3] in order to define the strategy from stage 77,3. Note that in Lemma [31 we did 
not prove that the strategy a could be selected in a measurable way with respect to the state. 
Thus, we need to use a finite approximation. The set X is a compact metric set, thus there 
exists a partition of X such that for every I G {!,...,L}, is measurable and 

diam(P*) ^ e. It follows that there exists a finite subset {x^, ...,x^} of B such that for every 
X G AriV\ d{x, x^) ^ 3e. We denote by ip the application which associates to every x G ACiV^ 
the state xK 

We define the strategy a as follows: 

• Play ctq until stage 773. 

• If x „3 G A, then there exists I G {1,..., L} such that x „3 G VK Play the strategy given by 
Lemma [31 with x = x^ and y = Xn^- If Xn^ ^ A, play any strategy. 

Let us check that the strategy tr guarantees a good payoff with respect to the long-run 
average payoff criterion. By definition, we have 



> - 2r]{d{Xn:,,'tlj{Xm))) - eJU) 

> 7 ;(xo) — 5e — 2r]{3e). 

Because 77 ( 0 ) = 0 and rj is continuous at 0, the gambling house r(xo) has a pathwise uniform 
value, and Theorem |T] is proved. □ 


4 Proofs of Theorem [2], Theorem [3] and Theorem [4] 

This section is dedicated to the proofs of Theorem [31 Theorem [3] and Theorem 0) Theorem [3] 
and Theorem [3] stem from Theorem |T] Theorem [3] is not a corollary of Theorem [1] Indeed, 
applying Theorem |T] to the framework POMDPs, would only yield the existence of the uniform 
value in pure strategies and not the existence of the pathwise uniform value. 

4.1 Proof of Theorem [2] 

Let P := {X,F,r) be a gambling house such that F is 1-Lipschitz. Without loss of generality, 
we can assume that r is 1-Lipschitz. Indeed, any continuous payoff function can be uniformly 
approximated by Lipschitz payoff functions, and dividing the payoff function by a constant does 
not change the decision problem. 

In order to prove Theorem|31 it is sufficient to prove that for all 77 ^ 1, Wn is 1-Lipschitz, and 
Woo is 1-Lipschitz. Indeed, it implies that the family {vmn ^ 1} is uniformly equicontinuous 
and Woo is continuous. Theorem [3] then stems from Theorem |TJ 

Recall that G : X ^ is defined for all x G X by G(x) := ScoU(x). 

Lemma 4. The correspondence G is 1-Lipschitz. 
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Proof. Let x and x' be two states in X. Fix ^ G G{x). Let us show that there exists /r' G G{x') 
such that dKRi^J■, ^J-') ^ d{x,x'). 

By definition of G{x), there exists r G A{F{x)) such that for all g G C{X, [0,1]), 


5 (m) = / 9 {z)v{dz). 

JA(X) 


IA{X) 

Let M = F{x) C A(X). We consider the correspondence $ : M A(X) defined for 2; G M by 

$(z) := {z' G F{x') I dKR{z,z') < d{x,x')}. 


Because F is 1-Lipschitz, $ has nonempty values. Moreover, $ is the intersection of two 
correspondences with a closed graph, therefore it is a correspondence with a closed graph. 
Applying Proposition [2 we deduce that $ has a measurable selector (f : M —5> A(X). 

Let v' G A(A(A)) be the image measure of by </?. Throughout the paper, we use the 
following notation for image measures: 


F := R o (p 

By construction, v'{F{x')) = 1 and for all h G C(A(A), [0,1]), 

f h{ip{z))v{dz) = f h(u)v'(du). 
Ja{X) Ja{X) 

Let g' := Bar(z^') and f € Ei. The function / is 1-Lipschitz, and 




/• 

f 



/(m) - /(/) 

= 

/ 

f{z)v{dz) - / 

f(u)v\du) 




J A{X) 

Ja{x) 



= 

/ 

f{z)v{dz) - [ 

f{if{z))v{dz) 



J A{X) 

Ja{x) 



< 

f 

'A(X) 

f{z) - fipiz)) 

v(dz) 



^ d{x, x). 


□ 

Because G is 1-Lipschitz, given {x,u) G Graph G and y € X, there exists w G G{y) such 
that dKR(u,w) ^ d(x,y). For our purpose, we need that the optimal coupling between u and 
w can be selected in a measurable way. This is the aim of the following lemma: 

Lemma 5. There exists a measurable mapping ip : Graph G x X ^ A(A x X) such that for 
all {x, u) G Graph G, for all y G X, 

• the first marginal of Ip(x,u,y) is u, 

• the second marginal of ip(x,u,y) is in G{y), 

• / d{s,t)ip{x,u,y){ds,dt) ^ d{x,y). 

JxxX 

Proof. Let S := Graph(G) x X, X' := A{X x X) and E : S ^ X' the correspondence defined 
for all (a:, u,y) G S by 


E{x, u, y) = {n G A(X X X) | tti = w, 7r2 G G(y)} , 
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where tti (resp. 772 ) denotes the first (resp. second) marginal of tt. The correspondence ^ has 
a closed graph. Let f : X' —>■ R defined by 


/w 


IXxX 


d(s, t)Tr(ds, dt). 


The function / is continuous. Applying the measurable maximum theorem (see [ll Theorem 
18.19, p.605]), we obtain that the correspondence s —>■ argmin/( 7 r) has a measurable selector, 

■kGS{s) 

which proves the lemma. □ 


Proposition 6. Let x,y € X and a be a strategy. Then there exist a probability measure 
on Hao X Hoo , and a strategy r such that: 

• has first marginal P^, 

• P^’*^ has second marginal 

• The following inequalities holds: for every n ^ 1 

E < d{x,y), 

and 

E%’y (limsup - ^ \r{Xm) - ?’(Pm)| ] < d{x,y), 

where Xm (resp. Ym) is the m-th coordinate of the first (resp. second) infinite history. 

Proof. Define the stochastic process (Am,Tm)m>o on {X x such that the conditional 
distribution of {Xm,Ym) knowing {Xi,Yi)o<^i^rn-i is 


— l , U'( Aq , -.., A^_i), Y^a—l ), 

with fit defined as in Lemma [SI Let P^’^ be the law on induced by this stochastic process 
and the initial distribution 5[x,y)- By construction, the first marginal of P^’^ is P“. 


For m € N* and (yo) ■•■ 5 1/m-i) S A™, define Tm{yo, ...,ym-i) G A(A) as being the law 
of Ym, conditional to Yq = yo, ...,Ym,-i = ym-i- By convexity of G, this defines a (behavior) 
strategy r in the game T. Moreover, the probability measure P“i is equal to the second marginal 
of 


For all m € N*, we have P^’^-almost surely 

E^’^(d(A„,r„)|A„_i,y„_i) = [ d(s',f')V’(A„_i,a(Ao,...,A„_i),r„_i)(ds',df'), 

JXxX 


IXxX 
^ d[Xjfi—i^Y^—ifi 


The random process {d{Xm,Ym))m^o is a positive supermartingale. Therefore, we have 

IE-" ( I E ^ E ’ 

\ m—l / \ m—1 / 

1 "■ 

= -Y.E:’y (d(A„,r„)), 

n f ^ 


^ dix,y). 
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Moreover, the random process {d{Xm,Ym))m^o converges P^’^-almost surely to a random 
variable D, such that ^ d{x,y). For every n ^ 1, we have 

^ n 1 ^ 

- \r{Xm) - r{Y^)\ < - d{Xm,Ym) 

m—l m—1 

and the Cesaro theorem yields 

1 " 

limsup - 'Y' \r{Xm) - r{Ym)\ < D P®’*' a.s. 

m=l 

Integrating the last inequality yields the proposition. □ 

Proposition ini implies that for all n ^ 1, Vn is 1-Lipschitz, and that Woo is 1-Lipschitz. Thus, 
Theorem [2] holds. 


4.2 Proof of Theorem [3] for MDPs 

In this subsection, we consider a MDP P = {K, I, g, q), as described in Subsection l2.2l the state 
space {K,dK) and the action set [I,di) are compact metric, and the transition function q and 
the payoff function g are continuous. As in the previous section, without loss of generality we 
assume that the payoff function g is in fact 1-Lipschitz. 


In the model of gambling house, there is no explicit set of actions. In order to apply Theorem 
[Uto r, we put the action played in the state variable. Indeed, we consider an auxiliary gambling 
house P, with state space K x I x K. At each stage m ^ 1, the state Xm in the gambling house 
corresponds to the state km+i) in the MDP. Formally, P is defined as follows: 

• The state space is X := K x I x K, equipped with the distance d defined by 

V(fc, i, 1), {k', i', /') e A, d{{k, i, /), {k', i', I')) = m&x{dK{k, k'),di{i, i'),dK{l, I'))- 


• The payoff function r : A ^ [0,1] is defined by: for all (fc, i, k') G A, r{k, i, k') := g(k, i). 

• The correspondence A : A —A(A) is defined by: 

V(fe, i,k') G K X I X K, F{k, i, k') := {5k>,i' ® q(k', i') : i' G 1} , 

where 6 k',e is the Dirac measure at (fc', *'), and the symbol 0 stands for product measure. 

Fix some arbitrary state ko G K and some arbitrary action fo S I- Given an initial state fei 
in the MDP P, the corresponding initial state xq in the gambling house P is {kQ,iQ,ki). By 
construction, the payoff at stage m in P(a:o) corresponds to the payoff at stage m in r(fci). 

Now let us check the assumptions of Theorem [TJ The state space A is compact metric. 
Because g is continuous, r is continuous, and the following lemma holds: 

Lemma 6. The correspondence F has a closed graph. 

Proof. Let (a:„,u„)ng]N G [Graph F)^ be a convergent sequence. By definition of F, for every 
n ^ 1, there exist {kn,in, k'^) G K x I x K and i'n & I such that 


and 

Un = 0 q[k'„,i'„). 


Moreover, the sequence (/c„, k!^, i'n)n^i converges to some [k, i, fc', i') G KxIxKxL Because 

the transition q is jointly continuous, we obtain that (it„) converges to S(^k',i') ® Q[k' , i'), which 
is indeed in F[k, i, k'). □ 
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We now prove that for all n G N*, is 1-Lipschitz, and that Wac is 1-Lipschitz^ It is more 
convenient to prove this result in the MDP F, rather than in the gambling house F. Thus, in 
the next proposition, Hoc = [K x I)°° is the infinite history in F, a strategy cr is a map from 
X (J X to A(/), and denotes the probability over Hoc generated by the pair 

(fci,cr). This proposition is similar to Proposition [51 

Proposition 7. Let ki,k[ € K and a be a strategy. Then there exist a probability measure 
on Hoc X Hoc, and a strategy r such that: 

• has first marginal , 

• has second marginal ¥r^, 

• The following inequalities hold: for every n ^ 1, 

\9{KmJm) - g{K’^j'nf)U < dif (fcl, fc'i): 

\”'m=l / 

and 

(limsup-^ \g{Kmo Im) - giKf,, r^)\ \ < fc'i), 

where Km,Im (resp. is the m-th coordinate of the first (resp. second) infinite 

history. 

• Under for all m ^ 1, Im = I'm- 

Proof. Exactly as in Lemma [SJ one can construct a measurable mapping tjj : K x K x I ^ 
A{K X K) such that for all (fc, fc', i) £ K x K x I, ipik, k',i) € A(K x K) is an optimal coupling 
between q{k,i) and q{k',i) for the KR distance. 

We define a stochastic process on IxKxIxK, in the following way: given an arbitrary action 
io, we set /g = /^ = io, Ki = ki, K[ = k[. Then, for all m > 2, given {Im-i, Km, 
we construct (Im, Km+i, I'm, K'm^i) as follows: 

• Im is drawn from a{Ki,Ii, ...,Km), 

• {Km+i, K'm+i) is drawn from if{Km,K'm,Im), 

• we set I'm '.= Im- 

By construction, has first marginal For m > 1 and hm = {k[,i[m--, k'm) € Hm, 

define T{hm) S A(/) as being the law of I'm, conditional to K[ = k{,I[ = i \,..., K'm = k'm. This 
defines a strategy. Moreover, for all m ^ 1, we have 

¥!l^''^'^{dK{Km+l,K'm+f)\Km,K'm) < dK{Km,K'm). 

The process {dx^Km, K'm))m^i is a positive supermartingale, thus it converges almost surely. 
We conclude exactly as in the proof of Propositional □ 

The previous proposition implies that the value functions Vn and Woo are 1-Lipschitz. There¬ 
fore, the family {vn,n ^ 1} is equicontinuous, and Woc is continuous. By Theorem [T1 the 
gambling house F has a pathwise uniform value in pure strategies. It follows that the MDP F 
has a pathwise uniform value in pure strategies, and Theorem |3] holds. 

Remark 4. Renault and Venel m define slightly differently the auxiliary gambling house 
associated to a MDP. Instead of taking K x I x K as the auxiliary state space, they take 
[0,1] X K, where the first component represents the stage payoff. In our framework, applying 
this method would lead to a measurability problem, when trying to transform a strategy in the 
auxiliary gambling house into a strategy in the MDP. 
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4.3 Proof of Theorem |4] for POMDPs 

In this subsection, we consider a POMDP P = {K,I,S,g,q), as described in Subsection 12.31 
the state space K and the signal space S are finite, the action set {I,di) is compact metric, 
and the transition function q and the payoff function g are continuous. 


A standard way to analyze P is to consider the belief Pm S A (AT) at stage m about the 
state as a new state variable, and thus consider an auxiliary problem in which the state is 
perfectly observed and lies in A{K) (see [T7], [H], [IS]). The function g is linearly extended to 
A{K) X A(/), in the following way: for all (p, u) £ A{K) x A(/), 

g{p,u) jg{kMd^. 

k^K 


Let q : A{K) x I ^ A{A{K)) be the transition on the beliefs about the state, induced by q: if 
at some stage of the game, the belief of the decision-maker is p, and he plays the action i, then 
his belief about the next state will be distributed according to q{p,i). We extend linearly the 
transition q on A{K) x A(J), in the following way: for all / £ C{A{K), [0,1]), 


/A(K) 


f{p)[q{p,u)\{dp)= ( [ f{p) [q{p,i)]{dp)u{di 

Ji Ja{k) 


We can also define an auxiliary gambling house P, with state space [0,1] x / x A{K)-. at 
stage m, the auxiliary state Xm corresponds to the triple {g{pm,im),im,Pm+i)- Formally, the 
gambling house P is defined as follows: 


• State space X := [0,1] x / x A{K): the set A{K) is equipped with the norm 1 ||.||^, and 
the distance d on X is d := max(|.|,d/, ||.||^). 

• Payoff function r : X —>• [0,1] such that for all x = (a, i,p) £ X, r{x) := a. 

• Correspondence A : X —A(X) defined for all x = {a,i,p) £ X by 
F{x) := {g{p, i') ( 8 > < 5 ^- ( 8 > q{p, i') : i' £ /}. 

Fix some arbitrary oq £ [0,1] and iq £ I. To each initial belief pi £ A{K) in F, we associate 
an initial state xo{p) in F by: 

xo{pi) ■= (ao,*o,Pi)- 


By construction, the payoff at stage m in the auxiliary gambling house F(a:o(pi)) corresponds 
to the payoff g{pm,im) in the POMDP r(pi). In particular, for all n £ N*, the value of the 
n-stage gambling house F(a;(pi)) coincides with the value of the n-stage POMDP r(pi), which 
is denoted by u„(pi). 

One could check that F satisfies the assumptions of Theorem |T] and therefore has a pathwise 
uniform value. This would especially imply that F has a uniform value in pure strategies, and 
it would prove that F has a uniform value in pure strategies. Indeed, let pi £ A{K) and d be 
a strategy in F(a;o(pi)). Let cr be the associated strategy in the POMDP r(pi). For all n ^ I, 
we have 


£ 9iPra,im)] 

\ ^ m=l / \ m=l / 
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Consequently, the fact that r(a;o(j3i)) has a uniform value in pure strategies implies that r(pi) 
also has a uniform value in pure strategies. 

Unfortunately, this approach does not prove Theorem 01 i.e. the existence of the pathwise 
uniform value in F, due to the following problem: 


Problem It may happen that 

E: 


Miminf- ^ r(a:„) ) > [liminf- ^ ■ 

\ n—f+oo n j \ n—>^+00 Tl 


m—1 


m—1 


Indeed, r{xm) is not equal to g{km,im)' it is the expectation of g{km,im) with respect to Pm- 
Consequently, the fact that 5 is a pathwise e-optimal strategy in r(a;o(pi)) does not imply that 
(T is a pathwise e-optimal strategy in r(pi). 


To prove Theorem^ we adapt the proof of Theorem[T]to the framework of POMDPs. Recall 
that the proof of Theorem [T] was decomposed into three lemmas (Lemmas [U [2] and [3]) and a 
conclusion (Subsection 13.511 . We adapt the three lemmas, and the conclusion is similar. 


In order to obtain the first lemma, we check that F has a closed graph. 

Proposition 8. The correspondence F has a closed graph. 

Proof. Let € {Graph F)^ be a sequence that converges to (x, u) € X x A{X). By 

definition of F, for every n ^ 1 there exists {an,in,Pn,in) € ([0,1] x / x A{K) x I) such that 

Xn — {O'n^'ln^Pn) ^ 


and 


Un = giPn,in)<^^i'n ®d{PnG'n)- 


It follows that the sequence {an,in,PnGn)n^i converges to some {a,i,p,i') & [0,1] x/x A(iF) x/ 
and X = (a, i,p). 


By Feinberg [H Theorem 3.2], the function q is jointly continuous. Because the payoff 
function g is also continuous, we obtain that converges to u = q{p, i') (8> 6 i' 0 q{p', i') which 
is indeed in F{x). □ 

Now we can apply Lemma [T] to the gambling house P. For p G A(/C), define v{jp) := 
limsup„_^_i_(^ u„(p). Note that for all x = {a,i,p) G X, the set F(x) depends only on the third 
component p. Thus, Lemma [1] implies the following lemma for the POMDP F: 

Lemma 7. Letp\ G A(K). There exists a distribution u* G A(A(K)) and a stationary strategy 
a* : A{K) -G A{I) such that 

• p* is a*-invariant: for all f G C{A{K), [0,1]), 



f{q{p.(^*{p))lJ*{dp) = f{g*) 


• For every e > 0 and N ^ 1, there exists a (pure) strategy a in T and n ^ N such that a 
is 0-optimal in F„(pi) and 


where Zm(pi,cr) is the distribution over A{K) at stage m, starting from pi, 
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g{p,cr*{p))fJ'*idp) = v{p*) = v{pi). 



We can now state a new lemma about pathwise convergence in F. This replaces Lemma [21 

Lemma 8. Letpi € A(isr) and p* be the corresponding measure in the previous lemma. There 
exists a measurable set B C A(iF) such that p*{B) = 1 and for all p G B, 

E^. [liminf- ^ ) = v{p) P^. - a.s. 

\ ri^+oo n ^' / 

\ m=l / 

Proof. It is not enough to apply Birkhoff’s theorem to the Markov chain due to the 

problem mentioned previously. Instead, we consider the random process on V := 

K X I X A{K), defined for all m ^ 1 by ym ■= (km,im,Pm)'- (current state, action played, 
belief about the current state). Under P((., this is a Markov chain. Indeed, given m ^ 1 and 
(yi, 2 / 2 , 2/m) G y™, the next state 2/m+i is generated in the following way: 

• a pair {km+i,Sm) is drawn from q{km, im), 

• the decision-maker computes the new belief Pm+i according to Pm and Sm, 


• the decision-maker draws an action im+i from a*{pm+i)- 


By construction, the law of 2/m+i depends only on pm, and (2/m)m>i is a Markov chain. Define 
ly* G A(y) such that the third marginal of ix* is p*, and for all p G A(/F), the conditional law 
iy*{.\p) G A{K X I) is p<^<t{p). Under P((,, for all m ^ 1, the third marginal of pm is distributed 
according to p*. Moreover, conditional on pm , the random variables km and im are independent, 
the conditional distribution of km knowing pm is Pm, and the conditional distribution of im 
knowing Pm is o'*{pm). Thus, v* is an invariant measure for the Markov chain {ym)m'^i- Define 
a measurable map f :Y —)■ [0,1] by: for all {k,i,p) G Y, f{k,i,p) = g(k,i). Now we can apply 
Theorem [5] to (2/m)m>i, and deduce that there exist Bq C A(iF) and w : Y —[0,1] such that 
for all p G Bq, 


1 

- ./(ym) w{ki,ii,p) P^. - almost surely, 

77, n—>-+oo 

m —1 


(5) 


and 

w{v*) = 


By definition of /, for all m ^ 1, /(ym) = y(fcm,*m)- Moreover, by definition of u*, we have 


f{i^*)=[ 9{p,cr*{p))T*{dp), 

JA{K) 


and by Lemma [71 we deduce that fiv*) = v{p*). Consequently, w{v*) = v{p*). Given p G Bq, 
denote by wqIji) the expectation of w{.,p) with respect to P^.. By Equation (jS]), we have 


E^. ( lim -y^y(fcm,*m) 

\ n^+oo 77 

\ m=l / 


wo{p). 


Let us prove that wq = v P((.-almost surely. Note that vuo{p*) = w{v*) = v{p*). Conse¬ 
quently, it is enough to show that wq ^ v P)).-almost surely. By the dominated convergence 
theorem and the definition of v, we have 


E((. I lim -'^g{km,im)] = bm E^. ( - g{km,im) 

n—>-+co 77 • ^ I \ n < ^ 


m —1 


m —1 


< vip), 


□ 


and the lemma is proved. 
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For every n ^ 1, the value function Vn is 1-Lipschitz, as a consequence of the following 
proposition. 

Proposition 9. Letp,p' € A(iF) and a be a strategy in F. Then, for every n ^ 1, 

'K ( - X! 9{km,im)\ I ^ ^ ^ ||p-p'||l. 

m=l / V”" m=l / 

This proposition is proved in Rosenberg, Solan and Vieille [HI Proposition 1]. In their 
framework, / is finite, but the fact that I is compact does not change the proof at all. 

Last, we establish the junction lemma, which replaces Lemma O 

Lemma 9. Letp,p' S A(Rr) and a be a strategy such that 


“ X! 9{km,im) ) = v{p). 

\ n—)-+oo Tl I 


m—1 


Then, the following inequality holds: 

EP' [ liminf-V ^ v{p') - 2\\p - p'\\^. 

\ n->+oo n ^‘ I 

\ m=l / 

Proof. Let k d K and pi G A(iF). Denote by ¥P'^{hoo\k) the law of the infinite history hoo G 
(K X J X in the POMDP F(pi), under the strategy cr, and conditional to fci = k. Then 
PP(/ioo|fc) =PP'(hoo|fc) and 


EP (liminf i V v{p) - \\p - p'\\.^ . 

\ n-S'+oo n ^‘ 

\ m=l / 

For every n ^ 1, is 1-Lipschitz, thus the function v is also 1-Lipschitz, and the lemma is 
proved. □ 

The conclusion of the proof is similar to Section 13.51 Note that apart from the three 
main lemmas, the only additional property used in Section 13.51 was that the family (vn)n^i is 
uniformly equicontinuous. For every n ^ 1, is 1-Lipschitz, thus the family (vn)n^i is indeed 
uniformly equicontinuous. 
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